1 Setup

In this notebook we use the following R packages: tidyverse and ggplot2. In addition we use the multiplot functionality provided by Cookbook for R. In Python we use fastai, numpy and pandas. Full code for training and inference available at GitHub. All code for statistical analysis is included in this document. By default all code chunks are hidden. They can easily be expanded by clicking the Code button.

2 Introduction

The success of deep learning models has typically been measured in terms of their predictive power, but they have lacked a principled way of expressing uncertainty about these predictions.

In my master’s thesis we apply recent insights connecting dropout neural networks to Bayesian approximate variational inference (VI). VI is technique for approximating intractable integrals that arise when modelling the posterior distribution in Bayesian inference and machine learning. Gal et. al. [1, 2, 3] have shown that the posterior distribution of a NN can be approximated by leaving dropout on at test time and sampling multiple predictions. This amounts to drawing Monte Carlo (MC) samples from the posterior (a brief description of this process is listed in section 1.1). The Bayesian framework allows us to obtain principled uncertainty estimates when making predictions in these so called MC dropout NNs.

The results of [1,2,3] seem particularly promising when applied to healthcare. In fact, a recent paper published in Nature [4] applies MC dropout to capture uncertainty estimates when predicting the presence of diabetic retinopathy in patients. This paper demonstrates how uncertainty estimates can provide useful information to the clinician tasked with interpreting the results of medical images. The key idea is that this human-machine interaction will lead to overall better results than either could produce individually.

In this notebook I will first briefly introduce the general idea behind MC dropout. Then we will apply MC dropout in a classification task based on the collection of labelled images in the CIFAR-10 data set.

2.1 Background

A neural network is made up of many neurons which are organized in layers. Neurons are often called nodes or units, depending on your choice of machine learning literature. Henceforth we will refer to them as units.

In a typical neural network there are many, many units. As a consequence the number of parameters greatly exceeds the number of data points, dramatically increasing the risk of overfitting (i.e. there are many settings of weights which will cause the network to fit the data almost perfectly).

Dropout is a common stochastic regularisation technique [5] that is used to prevent overfitting in neural networks. The term dropout refers to randomly “dropping out” nonoutput units, temporarily removing all connections to the rest of the network. The main idea is that if the presence of other units is unreliable, each unit is forced to learn how to be useful on its own. At test time all units are activated, and the weights of the network are scaled by the dropout rate in order to match the expected output during training.

Recent work by Gal et. al. [1, 2, 3] casts dropout neural networks as approximate Bayesian inference. Their results show that the predictive posterior distribution of a neural network can be approximated by leaving dropout on at test time.

Consider a classification setting, such as in CIFAR-10. Essentially, what happens when applying MC dropout is the following:

  • An image is fed forward through the network \(T\) times (we use \(T=100\) as recommended in [4]). Each time the image is fed through is called a stochastic forward pass.
  • For each stochastic forward pass, a slightly different network is making predictions because dropout has randomly switched off units.
  • As a result each stochastic forward pass returns 100 slightly different vectors of class predictions.
  • To make a prediction we average the 100 vectors. The class corresponding to the largest element in the resulting vector is our final prediction.
  • Finally we calculate the standard deviation in class predictions over all forward passes. This is our estimated uncertainty.

Mathematically, model uncertainy is approximated the empirical standard deviation of the predictions for class \(k\), i.e. \[\hat{\sigma}_k = \sqrt{\frac{\sum_{t=1}^T{[p_t(k|X,w) - \hat{\mu}_k}]^2}{T-1}}\] where \[\hat{\mu}_k = \frac{1}{T}\sum_{t=1}^T p_t(k|X,w)\] is the averaged softmax outputs of the predicted class.

Gal et. al. [3] show that the above amounts to drawing Monte Carlo samples from the predictive posterior. Their work demonstrates that applying dropout is effectively the same as defining a Bayesian neural network with a Bernoulli approximating prior over the parameters. Gal has written an excellent blog post that introduces the work and the derivation.

2.2 Problem statement

The derivation in Gal et. al. [2] is based on the observation that a shallow neural network with infinitely many weights converges to a Gaussian process (GP) with a specific covariance function. A GP is a non-parametric, probabilistic machine learning method. The idea is that we place a prior distribution over the function space, and by observing new data points we can figure out which function is most likely to have generated the observed data. A GP is fully specified by its mean and covariance functions, and allows us to obtain uncertainty estimates [1]. The extension of this to dropout neural networks is the main result of Gal et. al. [2].

In [1] however, Gal et. al. extend this idea further to include priors over the weights of the convolutional layers in a convolutional neural network (CNN). A CNN is a specialized type of neural network, used primarily and very effectively for image analysis. The authors briefly state that the GP interpretation is lost when the model is extended to CNNs, but that these networks still can be “modelled as Bayesian”.

As far as we are aware, there have been no efforts to determine the correlation between the approximated uncertainty and a network’s ability to predict correctly. Moreover, the work done so far seems to focus on the uncertainty associated with the predicted class. We will examine if there is any additional information to be gained from establishing a connection between the prediction and the runner-up prediction. In other words, our problem statement is:

Are the uncertainty approximations obtained by applying Monte Carlo dropout to convolutional neural networks a reasonable measure of model uncertainty?

2.3 Experimental setup

State-of-the-art architectures such as ResNet and DenseNet are very powerful, but they are also complicated and their inner workings are quite convoluted. We are primarily interested in examining the correlation between uncertainty estimation and predictive capabilities. It is arguably better to use a simple network architecture to illustrate the idea of MC dropout. In our approach we use LeNet-5. Click on the tabs for implementation details. Full code is available on GitHub.

2.3.1 Model

LeNet-5 was a pioneering 7-layer convolutional neural network originally developed by Yann LeCunn in 1998 for handwritten digit recognition. It is hopelessly primitive compared to contemporary architectures, but still captures the gist of what a convolutional network is while remaining simple enough to allow us to understand every building block of the network. The following chunk shows the model architecture. Toggle Code button to the right to view.

# Building simple Bayesian CNN with MC dropout and SGD optimizer
def lenet_all(input_shape, nb_classes, p=0.5):

    # Building model
    model=Sequential([
        Conv2D(input_shape=input_shape, filters=192, kernel_size=(5,5)), # 2D convolution layer, e.g. spatial convolution over images.
        DropoutMC(p),
        MaxPooling2D(strides=2), # Max pooling operation for spatial data
                
        Conv2D(192, kernel_size=(5,5)), # 2D convolution layer, e.g. spatial convolution over images.
        DropoutMC(p),
        MaxPooling2D(strides=2), # Max pooling operation for spatial data
                    
        Flatten(), # Flattens the input
        Dense(1000, activation="relu"), # Fully connected layer
        DropoutMC(p),
        Dense(nb_classes, activation="softmax") # Output layer
    ]) 
    
    # Compiling model
    model.compile(SGD(momentum=0.9, decay=0.0005), # Optimiser
                  loss="categorical_crossentropy", # Minimisation objective
                  metrics=["accuracy"]) # Evaluation metric
    
    return model


2.3.2 MC dropout layer

Our specification of LeNet-5 differs from the orginial in one crucial way: We use Monte Carlo dropout layers. MC dropout is not a feature that is implemented in PyTorch, and we must therefore implemenet one ourselves. Fortunately, this amounts to a simple adjustment of existing code. We modify the Dropout class to take an additional argument called dropoutMC with default value set to True. Toggle Code button to the right to view.

class DropoutMC(Layer):
    """Applies MC Dropout to the input.
    Dropout consists in randomly setting
    a fraction `rate` of input units to 0 at each update during training time,
    which helps prevent overfitting.
    MC Dropout consists in sampling from the posterior predictive distribution by activating dropout at test time.
    # Arguments
        rate: float between 0 and 1. Fraction of the input units to drop.
        noise_shape: 1D integer tensor representing the shape of the
            binary dropout mask that will be multiplied with the input.
            For instance, if your inputs have shape
            `(batch_size, timesteps, features)` and
            you want the dropout mask to be the same for all timesteps,
            you can use `noise_shape=(batch_size, 1, features)`.
        seed: A Python integer to use as random seed.
    # References
        - [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)
    """
    @interfaces.legacy_dropout_support
    def __init__(self, rate, noise_shape=None, seed=None, **kwargs):
        super(DropoutMC, self).__init__(**kwargs)
        self.rate = min(1., max(0., rate))
        self.noise_shape = noise_shape
        self.seed = seed
        self.supports_masking = True

    def _get_noise_shape(self, inputs):
        if self.noise_shape is None:
            return self.noise_shape

        symbolic_shape = K.shape(inputs)
        noise_shape = [symbolic_shape[axis] if shape is None else shape
                       for axis, shape in enumerate(self.noise_shape)]
        return tuple(noise_shape)

    def call(self, inputs, training=True):
        if 0. < self.rate < 1.:
            noise_shape = self._get_noise_shape(inputs)

            def dropped_inputs():
                return K.dropout(inputs, self.rate, noise_shape,
                                 seed=self.seed)
            return K.in_train_phase(dropped_inputs, inputs,
                                    training=training)
        return inputs

    def get_config(self):
        config = {'rate': self.rate,
                  'noise_shape': self.noise_shape,
                  'seed': self.seed}
        base_config = super(DropoutMC, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

    def compute_output_shape(self, input_shape):
        return input_shape


2.3.3 Inference

We need to define a function that performs inference over out input. The following chunk contains the relevant code for the sampling procedure described in section 1.1. The function inference stores all the relevant statistics and softmax distributions in a dictionary named output. The results are then turned into a pandas dataframe and some very basic feature engineering is performed. Finally, the data is prepared for statistical analysis in R. Toggle Code button to the right to view.

def batch(img, T=100):
    ''' Creating mini-batch of T identical images for use in inference.
        
        Arguments:
        img, numpy array
        T, number of stochastic forward passes (i.e. number of times image needs to be repeated)
        '''
    img_batch = []
    
    for n in range(0,T):
        img_batch.append(img)

    return(np.array(img_batch))

def inference(model, X, y, T=100, normalize=False):    
    ''' Function that performs inference by applying MC dropout with T stochastic forward passes on given input.
        
        Arguments:
        model, a model object
        X, images to be classified
        y, corresponding labels,
        T, number of stochastic forward passes
        '''
    # Get images, labels
    imgs, labels = X, y.argmax(axis=1)
    
    if normalize:
        # Preprocessing input
        imgs = imgs.astype("float32")
        imgs -= np.mean(X)
        imgs /= np.std(X, axis=0)
    
    # Empty dictionary to store all output
    output = {}

    k=0 # iterator index to keep in dictionary

    for (img, label, x) in list(zip(imgs, labels, X)): #added x,X
        
        # Get image batch
        img_batch = batch(img, T)
        
        # T predictions on same image
        results = model.predict(img_batch, batch_size=T)

        # Gathering results
        probs = results
        probs_mean = np.mean(probs, axis=0)
        pred_std = np.std(probs, axis=0)
        prediction = probs_mean.argmax()
        uncertainty = pred_std[prediction]
        correct = 1 if prediction == label else 0

        output[k] = {"img": x, "softmax_dist": probs, "probs": probs_mean, "prediction": prediction, "truth": label, "uncertainty": uncertainty, "correct": correct} # added x for img
    
        k+=1
        
    return output

2.4 Models

We will examine data gathered from four variants of LeNet-5. All models were trained on CIFAR-10 using the fastai API. CIFAR-10 contains 60.000 labelled 32x32x3 color images belonging to 10 different classes. The input data was split into a training set of 50.000 images and a test set of 10.000 images. The training set was further split into a training set and a validation set. All models have weight_decay = 0.0005 and all learning rates were chosen using lr_find:

  • model55: Trained for 60 epochs with a learning rate of 0.001 and kernel size = (5,5) and drop_rate = .5. 0.71280 validation loss at end of training with an accuracy of 0.76137 on the validation data. model55 represents the baseline implementation of LeNet-5. It is identical in structure to the one used by Gal et. al. [1].

  • model52: Trained for 14 epochs with a learning rate of 0.01 for the first 7 and 0.0001 on the remaining. Changed due to rapid overfitting. Model has kernel size = (5,5) and drop_rate = .2. 0.74186 validation loss at end of training with an accuracy of 0.74406 on the validation data.

  • model35: Trained for 66 epochs with a learning rate of 0.001 and kernel size = (3,3) and drop_rate = .5. 0.75483 validation loss at end of training with an accuracy of 0.74218 on the validation data.

  • model32: Trained for 52 epochs with a learning rate of 0.001 and kernel size = (3,3) and drop_rate = .2. 0.75449 validation loss at end of training with an accuracy of 0.74090 on the validation data.

Note that model55 has a slightly better baseline performance than the other models.

2.5 Data

The data contains the following variables after it has been prepared for analysis in R:

  • correct (logical): indicator the is TRUE if the predicted class label matches the true class label, else FALSE.

  • prediction (int): predicted class label.

  • truth (int): true class label.

  • uncertainty (dbl): empirical standard deviation of softmax values for predicted class.

  • prob1 (dbl): argmax of mean softmax output, i.e. mean probability of predicted class.

  • prob2 (dbl): mean probability of runner-up prediction.

  • class2 (int): class label of runner-up prediction.

  • logit_prob1 (dbl): logit transformation of prob1.

  • diff (dbl): prob1-prob2

  • diff_sd_ratio (dbl): diff/uncertainty.

All the variables above are pretty standard, with the exception of diff_sd_ratio. Intuitively, if diff is large, the averaged models all agree that class \(k\) is the correct prediction. If diff is small, the models sampled by MC dropout don’t agree on a single class. Thus diff also serves as a proxy for uncertainty. Model uncertainty, however, is approximated by the empirical standard deviation of the predictions for class \(k\). Thus diff_sd_ratio is expressed by \[\tau_{kj} = \frac{\hat{\mu}_k - \hat{\mu}_j}{\hat{\sigma}_k}\] where \(j\) is the runner-up prediction. \(\tau_{jk}\) gives us a ratio of two different measures of uncertainty.

3 Exploratory analysis

This section will be divided into to parts. First, we will examine the resulting data from model55, which will be regarded as our baseline model. Next, we will analyse the data from models obtained by varying kernel sizes and dropout rates.

3.1 Uncertainty analysis of baseline model

3.1.1 Entire data set

# Importing data
data <- as.tibble(read.csv("~/Documents/Masteroppgave/Data/Resultater/results_sgd_df_20180404.csv"))
df <- select(data, -X)
# Summarizing entire data set
summary(df)
    correct         prediction        truth      uncertainty            prob1            prob2        
 Min.   :0.0000   Min.   :0.000   Min.   :0.0   Min.   :0.0000002   Min.   :0.1712   Min.   :0.00000  
 1st Qu.:1.0000   1st Qu.:2.000   1st Qu.:2.0   1st Qu.:0.0361875   1st Qu.:0.6200   1st Qu.:0.01172  
 Median :1.0000   Median :5.000   Median :4.5   Median :0.1370855   Median :0.8678   Median :0.08034  
 Mean   :0.8381   Mean   :4.538   Mean   :4.5   Mean   :0.1284096   Mean   :0.7887   Mean   :0.12263  
 3rd Qu.:1.0000   3rd Qu.:7.000   3rd Qu.:7.0   3rd Qu.:0.2044695   3rd Qu.:0.9819   3rd Qu.:0.21110  
 Max.   :1.0000   Max.   :9.000   Max.   :9.0   Max.   :0.3662840   Max.   :1.0000   Max.   :0.49695  
                                                                                                      
     class2           diff           diff_sd_ratio      logit_prob1     
 Min.   :0.000   Min.   :0.0001831   Min.   :      0   Min.   :-1.5769  
 1st Qu.:2.000   1st Qu.:0.3882025   1st Qu.:      2   1st Qu.: 0.4891  
 Median :3.000   Median :0.7804983   Median :      5   Median : 1.8820  
 Mean   :4.019   Mean   :0.6660705   Mean   :   1506   Mean   : 2.5044  
 3rd Qu.:6.000   3rd Qu.:0.9697620   3rd Qu.:     27   3rd Qu.: 3.9928  
 Max.   :9.000   Max.   :1.0000000   Max.   :4859390   Max.   :16.3562  
                                                       NA's   :3        

For the entire set of classifications, we have the following notable quantities:

  • The mean uncertainty of the predicted class is 0.1284096 and the median is 0.1370855. The minimum value is 2.0578710^{-7}, the maximum is 0.366284. The interquartile range (IQR) is 0.1682821.

  • The mean softmax output of the predicted class is 0.7886975 and the median is 0.8678453. The minimum value is 0.1712327, the maximum is 1. The IQR is 0.3618946.

  • The mean softmax output of the runner-up is 0.122627 and the median is 0.0803399. The minimum value is 3.610851710^{-8}, the maximum is 0.4969542. The IQR is 0.1993783.

  • The mean difference between the softmax ouputs of the prediction and runner-up is 0.6660705 and the median is 0.7804983. The minimum value is 1.830756710^{-4}, the maximum is 1. The IQR is 0.5815595.

  • The mean difference to uncertainty ratio, or \(\tau_{jk}\), is 1506.1753085 and the median is 5.4525797. The minimum value is 8.648917410^{-4}, the maximum is 4.859389810^{6}. The IQR is 24.8902696.

3.1.2 Summary statistics

# Aggregating summary statistics by correct/incorrect
agg_df <- df %>% 
  group_by(correct) %>% 
  summarise(n=n(),
            mean_prob1=mean(prob1),
            mean_prob2=mean(prob2),
          mean_uncertainty=mean(uncertainty),
          median_uncertainty=median(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          median_diff=median(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          median_ratio=median(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
agg_df
  • The model has incorrectly classified 1619 images.

  • The model has correctly classified 8381 images.

3.1.3 Incorrect classifications

# Summarizing incorrect predictions
incorrect_df <- df %>% 
  filter(correct==0)
summary(incorrect_df)
    correct    prediction        truth        uncertainty         prob1            prob2               class2     
 Min.   :0   Min.   :0.000   Min.   :0.000   Min.   :0.0056   Min.   :0.1712   Min.   :0.0003806   Min.   :0.000  
 1st Qu.:0   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.1757   1st Qu.:0.4053   1st Qu.:0.1850289   1st Qu.:2.000  
 Median :0   Median :4.000   Median :3.000   Median :0.2086   Median :0.5102   Median :0.2487094   Median :4.000  
 Mean   :0   Mean   :4.254   Mean   :4.022   Mean   :0.2077   Mean   :0.5321   Mean   :0.2532203   Mean   :4.053  
 3rd Qu.:0   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:0.2413   3rd Qu.:0.6354   3rd Qu.:0.3245059   3rd Qu.:5.000  
 Max.   :0   Max.   :9.000   Max.   :9.000   Max.   :0.3663   Max.   :0.9991   Max.   :0.4925945   Max.   :9.000  
      diff           diff_sd_ratio        logit_prob1      
 Min.   :0.0001831   Min.   :  0.00086   Min.   :-1.57692  
 1st Qu.:0.0943908   1st Qu.:  0.42278   1st Qu.:-0.38356  
 Median :0.2113830   Median :  0.96571   Median : 0.04078  
 Mean   :0.2788916   Mean   :  1.83170   Mean   : 0.18488  
 3rd Qu.:0.4137291   3rd Qu.:  1.96640   3rd Qu.: 0.55535  
 Max.   :0.9986838   Max.   :178.33684   Max.   : 6.97348  

For the entire set of incorrect classifications, we have the following notable quantities:

  • The mean uncertainty of the predicted class is 0.2076574 and the median is 0.208626. The minimum value is 0.0056, the maximum is 0.366284. The interquartile range (IQR) is 0.065641.

  • The mean softmax output of the predicted class is 0.5321119 and the median is 0.510194. The minimum value is 0.1712327, the maximum is 0.9990644. The IQR is 0.230109.

  • The mean softmax output of the runner-up is 0.2532203 and the median is 0.2487094. The minimum value is 3.806021410^{-4}, the maximum is 0.4925945. The IQR is 0.139477.

  • The mean difference between the softmax ouputs of the prediction and runner-up is 0.2788916 and the median is 0.211383. The minimum value is 1.830756710^{-4}, the maximum is 0.9986838. The IQR is 0.3193383.

  • The mean difference to uncertainty ratio, or \(\tau_{jk}\), is 1.8317041 and the median is 0.9657115. The minimum value is 8.648917410^{-4}, the maximum is 178.3368412. The IQR is 1.5436196.

3.1.4 Correct classifications

# Summarizing correct predictions
correct_df <- df %>% 
  filter(correct==1)
summary(correct_df)
    correct    prediction        truth        uncertainty            prob1            prob2              class2     
 Min.   :1   Min.   :0.000   Min.   :0.000   Min.   :0.0000002   Min.   :0.2120   Min.   :0.000000   Min.   :0.000  
 1st Qu.:1   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.0241575   1st Qu.:0.7323   1st Qu.:0.007342   1st Qu.:2.000  
 Median :1   Median :5.000   Median :5.000   Median :0.1052500   Median :0.9171   Median :0.049465   Median :3.000  
 Mean   :1   Mean   :4.592   Mean   :4.592   Mean   :0.1131009   Mean   :0.8383   Mean   :0.097400   Mean   :4.012  
 3rd Qu.:1   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:0.1883090   3rd Qu.:0.9888   3rd Qu.:0.160019   3rd Qu.:6.000  
 Max.   :1   Max.   :9.000   Max.   :9.000   Max.   :0.3562930   Max.   :1.0000   Max.   :0.496954   Max.   :9.000  
                                                                                                                    
      diff           diff_sd_ratio      logit_prob1    
 Min.   :0.0003563   Min.   :      0   Min.   :-1.313  
 1st Qu.:0.5665170   1st Qu.:      3   1st Qu.: 1.005  
 Median :0.8653638   Median :      8   Median : 2.402  
 Mean   :0.7408638   Mean   :   1797   Mean   : 2.953  
 3rd Qu.:0.9809582   3rd Qu.:     40   3rd Qu.: 4.475  
 Max.   :1.0000000   Max.   :4859390   Max.   :16.356  
                                       NA's   :3       

For the entire set of correct classifications, we have the following notable quantities:

  • The mean uncertainty of the predicted class is 0.1131009 and the median is 0.10525. The minimum value is 2.0578710^{-7}, the maximum is 0.356293. The interquartile range (IQR) is 0.1641515.

  • The mean softmax output of the predicted class is 0.8382634 and the median is 0.917101. The minimum value is 0.2119797, the maximum is 1. The IQR is 0.2565594.

  • The mean softmax output of the runner-up is 0.0973997 and the median is 0.0494654. The minimum value is 3.610851710^{-8}, the maximum is 0.4969542. The IQR is 0.1526768.

  • The mean difference between the softmax ouputs of the prediction and runner-up is 0.7408638 and the median is 0.8653638. The minimum value is 3.562569610^{-4}, the maximum is 1. The IQR is 0.4144412.

  • The mean difference to uncertainty ratio, or \(\tau_{jk}\), is 1796.7769426 and the median is 8.1020167. The minimum value is 0.0019334, the maximum is 4.859389810^{6}. The IQR is 37.632868.

3.2 Distribution of uncertainty

In the following we will visualize the relationships between our variabels. We start by examining the empirical distribution of the uncertainty estimates \(\hat{\sigma}_k\).

3.2.1 Full distribution

The distribution appears to be bimodal, with peaks close to 0 and 0.2:

# Distribution of estimated uncertainty
p1 <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_histogram(col="grey", bins = 50, alpha=.5) +
  ggtitle("Distribution of estimated uncertainty")
p1

3.2.2 Uncertainty by prediction

By grouping the uncertainty estimates by correct (i.e. if the label was correctly predicted or not), we can find out how the predictions contribute to the uncertainty distribution.

The blue line corresponds to the correct predictions, the red line corresponds to incorrect predictions. We see that the incorrect predictions are centered around a higher associated uncertainty, whereas far more of the correctly predicted classes are concentrated around a low uncertainty value. The incorrect classifications greatly contribute to the bimodality, but it is also present in the distribution of uncertainty for the correct classifications.

#Distribution of estimated uncertainty by prediction
p2 <- df %>% 
  ggplot(aes(x=uncertainty, col=factor(correct))) +
  geom_freqpoly(alpha=.7) +
  ggtitle("Distribution of estimated uncertainty by classification") +
  scale_color_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct"))
p2

3.2.3 Kernel density estimates

The following kernel density estimate plot (using a Gaussian kernel) gives us an idea of how the distributions compare to eachother:

# KDE by correct prediction
p3 <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_density(data=incorrect_df, fill="red", alpha=I(.2)) +
  geom_density(data=correct_df, fill="turquoise", alpha=I(.2)) +
  ggtitle("Kernel density estimates of uncertainty distribution")
p3

3.2.4 Boxplots

The boxplot gives us yet another way to visualize the difference between uncertainty distributions by predictive ability. It is interesting to note the amount of outliers for the incorrect predictions, indicating som negative skew. It is wrong, but uncertainty is low.

# Boxplot of uncertainties for correct vs. incorrect
p4 <- df %>% 
  ggplot(aes(x=factor(correct), y=uncertainty)) +
  geom_boxplot(aes(fill=factor(correct)), alpha=.7) +
  labs(x="correct") +
  ggtitle("Boxplot of uncertainty distribution by correct/incorrect") +
  scale_fill_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct"))
p4

3.3 Relationship to other variables

3.3.1 Softmax of predicted class

We may also be interested in the relationship between uncertainty and other variables. First, we plot uncertainty against prob1 (the prediction’s softmax output). The softmax outputs in the above plot are colour graded. Outputs close to 1 are red, outputs close to 0 are blue. The plot shows a clear parabolic shape:

# Plotting softmax output of prediction against estimated uncertainty
p5 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(alpha=.2) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class") +
  scale_color_distiller(name="Softmax output",
                        palette = "Spectral")
p5

3.3.2 By classification

Red points indicate incorrect classifications, blue points indicate correct classifications.

p6 <- df %>% 
  mutate(correct=as.logical(correct)) %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(fill=correct, col=correct), shape=21, alpha=.5) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class")
p6

3.3.3 Predictions and runners-up

We can obtain more information by colouring the points by the value of the runner-up predictions. This plot is particularly interesting. The softmax outputs of the runner-up classes \(\hat{\mu}_j\) in the above plot are colour graded. \(\hat{\mu}_j \approx 0.5\) are red, \(\hat{\mu}_j \approx 0\) are blue. The round point is the pair mean values of (prob1, prob2) for the incorrect predictions. The triangular point is the pair mean values for the correct predictions.

We see a clear concentration of red points in the area where the probability of the predictied class \(\hat{\mu}_k \approx 0.5\). If the softmax predictions of both the predicted class and the runner-up are close 0.5, then we have a situation analogous to maximum entropy. This points coincide with the largest approximated uncertainty values.

# Plotting softmax output of prediction against estimated uncertainty, coloured by softmax output of runner-up
p7 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2), shape=21, alpha=.7) +
  geom_point(data=agg_df, aes(x=mean_prob1, y=mean_prob2, shape=as.logical(correct)), size=2.5) +
  #geom_point(data=agg_df, aes(y=uncertainty, x=mean_prob1, shape=as.logical(correct)), size=2) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class", shape="correct") +
  scale_color_distiller(name="runner up",
                        palette = "Spectral")
p7

3.3.4 Softmax of prediction by classification

In the following plot we split the observations by incorrect/correct predictions, and plot the values of uncertainty against the softmax output of the predicted class:

# Plotting uncertainty vs. softmax output of prediction
p8a <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2), shape=21, alpha=.7) +
  ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
  labs(x="softmax output of prediction") +
  facet_grid(.~factor(correct)) +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral")
p8a

3.3.5 Softmax of prediction by classification with contours

The black contour lines indicate where most of the points are concentrated. The plot on the left is for incorrect predictions. The right hand plot represents the correct predictions. For the correct predictions, it seems as if far more of the points are concentrated around high predicted output/low runner-up output/low uncertainty. This is not surprising, considering 75% of the correct predictions have a softmax value of approximately 0.7 or above. For the incorrect classifications, most of the points are concentrated around the area of maximum entropy. This indicates that the approximated uncertainty estimates indeed contain valuable information in the incorrect cases.

# Plotting uncertainty vs. softmax output of prediction
p8 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2), shape=21, alpha=.7) +
  geom_density_2d(col="black", alpha=.3) +
  ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
  labs(x="softmax output of prediction") +
  facet_grid(.~factor(correct)) +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral")
p8

3.3.6 Runner-up predictions

The following plot shows the relationship between uncertainty estimates and the softmax output of the runner-up prediction. Unsurprisingly, model uncertainty increases as the softmax output of the runner-up increases. We have plotted a LOESS estimate of the mean uncertainty as a function of the runner-up output to make this clearer:

# Plotting softmax output of runner-up against estimated uncertainty
p10 <- df %>% 
  ggplot(aes(x=prob2, y=uncertainty)) +
  geom_point(alpha=.2) +
  geom_smooth(method="loess") +
  labs(x="softmax output of runner-up") +
  ggtitle("Uncertainty vs. softmax output of runner-up")
p10

3.4 Images associated with high uncertainty

The following plots were generated using Python (code available on GitHub). On the left hand side we see the unnormalized image with the corresponding ground truth label. The plot in the middle shows the softmax output of the predicted class for each of the \(T=100\) stochastic forward passes. \(\mu_k\) is given by the solid red line, \(\mu_j\) is given by the dashed brown line. The plot title shows both the predicted class and the runner-up class. To the right is a kernel density estimate (using a Gaussian kernel) of the \(T=100\) softmax outputs for the predicted class. The plots show the top 5 most uncertain classifications in the entire data set.

NOTE: Add new images.

3.4.1 Incorrect

3.4.2 Correct

3.5 Uncertainty-prediction correlation

As mentioned in section 2.2, the connection established between dropout neural networks and GPs are lost when applied to convolutional neural networks. Performing a logistic regression gives us a simple way of testing if the approximated uncertainty is a significant predictor of the model’s ability to predict correctly.

The coefficient associated with uncertainty is highly significant, indicating that the estimates gathered from performing MC dropout are indeed a useful quantification of predictive uncertainty. Given a unit increase in uncertainty, we expect the probability of predicting correctly to decrease.

# Fitting logistic regression model to check significance of uncertainty
model_sd <- glm(factor(correct)~uncertainty, data=df, family = binomial(link="logit"))
summary(model_sd)

Call:
glm(formula = factor(correct) ~ uncertainty, family = binomial(link = "logit"), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8062   0.1968   0.3105   0.6347   1.6688  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.99779    0.08754   45.67   <2e-16 ***
uncertainty -14.32714    0.42711  -33.55   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8856.1  on 9999  degrees of freedom
Residual deviance: 7234.4  on 9998  degrees of freedom
AIC: 7238.4

Number of Fisher Scoring iterations: 5

3.6 Referral criteria

The question is then: How do we determine a reasonable uncertainty value for referral to a human expert? As we saw in section 3.1.2, the mean uncertainty values of the incorrect predictions higher than for the correct predictions (0.2076574 vs. 0.1131009, respectively).

Naively, we could set the threshold for referral to the mean uncertainty of the incorrectly classified images:

# Counting number of correct/incorrect by uncertainty >= .207 
referral_mean <- df %>% 
  filter(uncertainty>=.207) %>% 
  count(correct)
referral_mean

The problem here is apparent: Many of the correctly classified images are associated with a relatively high level of uncertainty (in our case, this is due to the bimodality of the uncertainty distribution in section 3.2.1). The proportion of incorrectly classified images in the referred sample is 0.346845.

The result of our logistic regression suggests that uncertainty is a significant predictor of a model’s ability to predict correctly. However, although uncertainty seems to contain valueable information, the relative uncertainties of the correct/incorrect observations (in this case) may be too small to differentiate which images should be referred to an expert. We may need to amplify the quantification of uncertainty in some way.

3.6.1 Usefulness of runner-up predictions

For the most uncertain incorrectly classified images in section 3.4.1 the runner-up suggestions are in fact the ground truth labels 80% of the time. This may be due to chance, but still begs the question: Is there any information to be obtained from the runner-up predictions? What would happen to our overall accuracy if we used the runner-up predictions for all incorrect classifications?

# Counting classification accuracy if runner-up is equal to ground truth
class2_df <- df %>%
  mutate(correct=replace(correct, correct==0 & class2==truth, 1)) %>% 
  count(correct)
class2_df

Accuracy would rise to 0.9393. This indicates that there may be some valuable information to be gathered from the runner-up predictions.

3.6.2 A revised quantification of uncertainty

One possible approach is to use the value of \(\tau_{jk}\) introduced in section 2.5. Recall that \(\tau_{jk}\) incorporates information about the runner-up prediction into our quantification of uncertainty. Consider the following cases:

  • \(\tau_{kj} \approx 1\) means that diff and uncertainty are relatively similar. This happens if a) the models have failed to reach a consensus (diff is small) but model uncertainty is low, or b) the models have reached a consensus (diff is large) but model uncertainty is high. Let’s call these referral predictions.

  • \(\tau_{kj} \to 0\) means that uncertainty is much larger than diff. These should represent uncertain predictions.

  • \(\tau_{kj} \to \infty\) means that uncertainty is much smaller than diff. These should represent non-referral predictions.

As we saw in table 3.1.2, the mean \(\tau_{jk}\) for the incorrectly classified images is 1.8317041 and 1796.7769426 for the correctly classified images. The respective medians are 0.9657115 and 8.1020167.

As a first step, setting the referral threshold to \(\tau_{jk} \leq 1.83\) (the mean uncertainty of the incorrect classifications) gives:

# Counting number of correct/incorrect by tau <= 2.8
referral_meantau <- df %>% 
  filter(diff_sd_ratio <= 1.83) %>% 
  count(correct)
referral_meantau

We run into the same problem as when we used the mean uncertainty: Many of the correctly predicted images would be referred. It is interesting to note, however, that we get more referrals of incorrectly classified images. In this case, the proportion of incorrectly classified images in the referred sample is 0.4581071.

Upon inspection, the boxplot of the log-transformed \(\tau_{jk}\) clearly shows the presence of extreme values (outliers marked in red) in both tails. For incorrect predictions, the distribution seems to be negatively skewed. The distribution of the correct predictions appear to be positively skewed.

pt <- df %>% 
  ggplot(aes(x=factor(correct), y=log(diff_sd_ratio))) +
  geom_boxplot(aes(fill=factor(correct)), outlier.colour = "red", outlier.alpha = .15) +
  xlab("correct") +
  ylab("log-transformed tau")  +
  ggtitle("Presence of extreme uncertainty values") +
  labs(fill="correct")
pt

Perhaps a more robust measure of the central tendency of \(\tau_{jk}\) is the median. Setting the referral rate of \(\tau_{jk} \leq 1.2\) gives the following:

# Counting number of correct/incorrect by tau <= 1
referral_medtau <- df %>% 
  filter(diff_sd_ratio <= .965) %>% 
  count(correct)
referral_medtau

This is a clear improvement over our the mean approach, in terms of the amount of referred images. 373 fewer incorrect images are referred compared to 673 fewer correct images. In this case, the proportion of incorrectly classified images in the referred sample is 0.5274151.

When compared to using the mean of \(\hat{\sigma}_k\) as a cut-off for referral, the median of \(\tau_{jk}\) results in almost 20% more referrals of incorrect image relative to the sample size. This could indicate that \(\tau_{jk}\) is a more sensitive uncertainty quantification for practical applications.

3.7 Comparison of tau and uncertainty

We can examine this further by plotting the number of correct/incorrect referrals against different values for \(\tau_{jk}\) and \(\sigma_k\).

3.7.1 Data preparation

First we need to prepare the data for plotting. In the following chunk we count the number of referrals of correctly/incorrectly classified images by increasing the values of \(\tau_{jk}\) and \(\sigma_k\):

  • For \(\tau_{jk}\), we set the initial value \(\tau_1 = 0.05\) and increment by 0.01 until we reach a maximum value of 10. This cut-off value was chosen because it gives us roughly the same proportion of correct vs. incorrect classifications as our threshold for \(\sigma_k\). For every \(\tau_{jk} \leq \tau_i\) we count the number correct and incorrect classifications of the referred images. Finally, we calculate the relative proportions of correctly/incorrectly classified images per value.

  • For \(\sigma_k\), we set the initial value to \(\sigma_1=0.01\) and increment by 0.01 until we reach a maximum value of .35. This cut-off value was chosen because it gives us roughly the same proportion of correct vs. incorrect classifications as our threshold for \(\tau_{jk}\). For every \(\sigma_{k} \geq \sigma_i\) (i.e. all images that have higher uncertainty than this value) we count the number correct and incorrect classifications of the referred images.

# Preparing data
count_df <- df %>% 
  rename(tau=diff_sd_ratio) %>% 
  select(correct, uncertainty, tau)
# Gathering number of referrals by tau
roc_tau <- data.frame(n_false=numeric(), n_true=numeric(), tau_value=numeric())
idx=1
tau_seq <- seq(0.05, 10, by=.1)
for(value in tau_seq){
  get_results <- count_df %>%
    filter(tau<=value) %>% 
    count(correct)
  roc_tau[idx,] <- c(get_results$n, value)
  idx <- idx + 1
}
plot_tau <- roc_tau %>% 
  gather(status, n, -tau_value) %>% 
  group_by(tau_value) %>% 
  mutate(p=n/sum(n)) %>% 
  ungroup()
## Gathering number of referrals per by uncertainty
roc_unc <- data.frame(n_false=numeric(), n_true=numeric(), uncertainty_value=numeric())
idx=1
uncertainty_seq <- seq(0.01, .35, by=.01)
for(value in uncertainty_seq){
  get_results <- count_df %>%
    filter(uncertainty>=value) %>% 
    count(correct)
  roc_unc[idx,] <- c(get_results$n, value)
  idx <- idx + 1
}
plot_unc <- roc_unc %>% 
  gather(status, n, -uncertainty_value) %>% 
  group_by(uncertainty_value) %>% 
  mutate(p=n/sum(n)) %>% 
  ungroup()

3.7.2 Plotting

The first row of the plot contains the referrals for varying values of \(\sigma_k\). The second row shows referral counts for varying values of \(\tau_{jk}\). The first column show the absolute counts (note the different x-axes) and the second column shows the relative referral rates. The dashed horisontal lines represent the mean of \(\sigma_k\) and the median of \(\tau_{jk}\), respectively, for the incorrectly classified images.

# Absolute counts
p1a <- ggplot(plot_unc, aes(x=uncertainty_value, y=n, group=status)) +
  geom_line(aes(col=status)) +
  geom_vline(xintercept=agg_df$mean_uncertainty[1], lwd=.3, lty="dashed") +
  labs(x="uncertainty", y="number of referrals") +
  ggtitle("Absolute referral counts") +
  theme(legend.position = "none")
p2a <- ggplot(plot_tau, aes(x=tau_value, y=n, group=status)) +
  geom_line(aes(col=status)) +
  geom_vline(xintercept=agg_df$median_ratio[1], lwd=.3, lty="dashed") +
  labs(x="tau", y="number of referrals") +
  ggtitle("Absolute referral counts")+
  theme(legend.position = "none")
# Relative counts
p1b <- ggplot(plot_unc, aes(x=uncertainty_value, y=p, group=status)) +
  geom_line(aes(col=status)) +
  labs(x="uncertainty", y="proportion of referrals") +
  geom_vline(xintercept=agg_df$mean_uncertainty[1], lwd=.3, lty="dashed") +
  ggtitle("Relative referral rates")+
  theme(legend.position = "none")
p2b <- ggplot(plot_tau, aes(x=tau_value, y=p, group=status)) +
  geom_line(aes(col=status)) +
  labs(x="tau", y="proportion of referrals") +
  geom_vline(xintercept=agg_df$median_ratio[1], lwd=.3, lty="dashed") +
  ggtitle("Relative referral rates")+
  theme(legend.position = "none")
layout <- matrix(c(1,2,3,4), nrow = 2, ncol = 2, byrow=TRUE)
multiplot(p1a, p1b, p2a, p2b, layout = layout)

3.7.3 Tau vs. uncertainty

Plotting the absolute referral counts for varying values of \(\tau_{jk}\) (blue line) and \(\sigma_k\) (red line) in the same plot tells as that \(\tau_{jk}\) gives us more “bang for the buck”, so to speak.

p3a <- ggplot(roc_tau, aes(x=n_true, y=n_false)) +
  geom_line(col="blue") +
  geom_line(data=roc_unc, aes(x=n_true, y=n_false), col="red") +
  ylim(c(0,2000)) +
  xlim(c(0,8000))
p3a

4 Digit classification on MNIST

Draft: Checking to see if \(\tau_{jk}\) is useful on a completely different data set (MNIST).

# Importing MNIST data
data_mnist <- as.tibble(read.csv("~/Documents/Masteroppgave/Data/Resultater/mnist_df.csv"))
df_mnist <- select(data_mnist, -X)
# Summarizing entire data set
summary(df_mnist)
    correct         prediction        truth        uncertainty            prob1            prob2          
 Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000011   Min.   :0.2871   Min.   :0.0000002  
 1st Qu.:1.0000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.0004957   1st Qu.:0.9928   1st Qu.:0.0001036  
 Median :1.0000   Median :4.000   Median :4.000   Median :0.0024089   Median :0.9991   Median :0.0005784  
 Mean   :0.9864   Mean   :4.433   Mean   :4.443   Mean   :0.0265448   Mean   :0.9747   Mean   :0.0186610  
 3rd Qu.:1.0000   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:0.0166062   3rd Qu.:0.9998   3rd Qu.:0.0047357  
 Max.   :1.0000   Max.   :9.000   Max.   :9.000   Max.   :0.3386140   Max.   :1.0000   Max.   :0.4948297  
     class2           diff           diff_sd_ratio       logit_prob1     
 Min.   :0.000   Min.   :0.0002854   Min.   :     0.0   Min.   :-0.9096  
 1st Qu.:2.000   1st Qu.:0.9881017   1st Qu.:    59.2   1st Qu.: 4.9280  
 Median :5.000   Median :0.9984862   Median :   414.4   Median : 6.9962  
 Mean   :5.119   Mean   :0.9560544   Mean   :  4602.0   Mean   : 6.7159  
 3rd Qu.:7.000   3rd Qu.:0.9997167   3rd Qu.:  2016.7   3rd Qu.: 8.6554  
 Max.   :9.000   Max.   :0.9999995   Max.   :899714.7   Max.   :15.4349  
# Aggregating summary statistics by correct/incorrect
agg_df_mnist <- df_mnist %>% 
  group_by(correct) %>% 
  summarise(n=n(),
            mean_prob1=mean(prob1),
            mean_prob2=mean(prob2),
          mean_uncertainty=mean(uncertainty),
          median_uncertainty=median(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          median_diff=median(diff),
          sd_diff=sd(diff),
          mean_tau=mean(diff_sd_ratio),
          median_tau=median(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
agg_df_mnist
# Finding cases where runner up equals true class
medtau <- df_mnist %>% 
  filter(class2 == truth) %>% 
  summarise(n = n(),
    mean_uncertainty = mean(uncertainty),
            median_uncertainty = median(uncertainty),
            mean_tau = mean(diff_sd_ratio),
            median_tau = median(diff_sd_ratio))
medtau
# Setting referral rate to mean uncertainty of incorrect predictions
referral_mnist_sigma <- df_mnist %>% 
  filter(uncertainty >= .22) %>% 
  count(correct)
referral_mnist_sigma

In this case, the proportion of incorrectly classified images in the referred sample is 0.2666667.

# Setting referral rate to mean uncertainty of incorrect predictions
referral_mnist_tau <- df_mnist %>% 
  filter(diff_sd_ratio <= 1.36) %>% 
  count(correct)
referral_mnist_tau

In this case, the proportion of incorrectly classified images in the referred sample is 0.3941176. Again, preliminary results suggest that runner-up predictions can be leveraged to produce sensible cut-off values for referral.

---
title: "Analysis of approximated uncertainty in deep learning"
author: "Sean Meling Murray"
output:
  html_document:
    df_print: paged
    toc: yes
    code_folding: hide
  html_notebook:
    number_sections: yes
    toc: yes
    code_folding: hide
---

***
# Setup
In this notebook we use the following R packages: `tidyverse` and `ggplot2`. In addition we use the `multiplot` functionality provided by [Cookbook for R](http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/). In Python we use `fastai`, `numpy` and `pandas`. Full code for training and inference [available at GitHub](https://github.com/smu095/smu095.github.io/blob/master/dat259.ipynb). All code for statistical analysis is included in this document. By default all code chunks are hidden. They can easily be expanded by clicking the `Code` button.

```{r echo=FALSE, results='hide'}
# Loading useful packages
library(tidyverse)
library(ggplot2)
library(Hmisc)
library(knitr)
#library(kableExtra)

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

# Data cleaning function
process <- function(ROOT, filename, name, kernel_sz, drop_rate){
  ROOT <- "~/Documents/Masteroppgave/Data/Resultater/"
  PATH <- paste(ROOT, filename, sep="")
  dat <- as.tibble(read.csv(as.character(PATH)))
  #dat <- select(dat, -X)
  dat <- dat %>% 
    mutate(model=name,
           kernel=kernel_sz,
           dropout=drop_rate)
  return(dat)
}
```

# Introduction

The success of deep learning models has typically been measured in terms of their predictive power, but they have lacked a principled way of expressing uncertainty about these predictions.

In my master's thesis we apply recent insights connecting dropout neural networks to [Bayesian approximate variational inference (VI)](https://arxiv.org/abs/1601.00670). VI is technique for approximating intractable integrals that arise when modelling the posterior distribution in Bayesian inference and machine learning. Gal et. al. [1, 2, 3] have shown that the posterior distribution of a NN can be approximated by leaving dropout on at test time and sampling multiple predictions. This amounts to drawing Monte Carlo (MC) samples from the posterior (a brief description of this process is listed in section 1.1). The Bayesian framework allows us to obtain principled uncertainty estimates when making predictions in these so called *MC dropout* NNs. 

The results of [1,2,3] seem particularly promising when applied to healthcare. In fact, a recent paper published in Nature [4] applies MC dropout to capture uncertainty estimates when predicting the presence of diabetic retinopathy in patients. This paper demonstrates how uncertainty estimates can provide useful information to the clinician tasked with interpreting the results of medical images. The key idea is that this human-machine interaction will lead to overall better results than either could produce individually.

In this notebook I will first briefly introduce the general idea behind MC dropout. Then we will apply MC dropout in a classification task based on the collection of labelled images in the [CIFAR-10 data set](https://www.cs.toronto.edu/~kriz/cifar.html).

## Background

A neural network is made up of many neurons which are organized in layers. Neurons are often called nodes or units, depending on your choice of machine learning literature. Henceforth we will refer to them as units. 

In a typical neural network there are many, many units. As a consequence the number of parameters greatly exceeds the number of data points, dramatically increasing the risk of overfitting (i.e. there are many settings of weights which will cause the network to fit the data almost perfectly).

Dropout is a common stochastic regularisation technique [5] that is used to prevent overfitting in neural networks. The term dropout refers to randomly "dropping out" nonoutput units, temporarily removing all connections to the rest of the network. The main idea is that if the presence of other units is unreliable, each unit is forced to learn how to be useful on its own. At test time all units are activated, and the weights of the network are scaled by the dropout rate in order to match the expected output during training.

Recent work by Gal et. al. [1, 2, 3] casts dropout neural networks as approximate Bayesian inference. Their results show that the predictive posterior distribution of a neural network can be approximated by leaving dropout on at test time. 

Consider a classification setting, such as in CIFAR-10. Essentially, what happens when applying MC dropout is the following:

* An image is fed forward through the network $T$ times (we use $T=100$ as recommended in [4]). Each time the image is fed through is called a *stochastic forward pass*.
* For each stochastic forward pass, a slightly different network is making predictions because dropout has randomly switched off units. 
* As a result each stochastic forward pass returns 100 slightly different vectors of class predictions.
* To make a prediction we average the 100 vectors. The class corresponding to the largest element in the resulting vector is our final prediction.
* Finally we calculate the standard deviation in class predictions over all forward passes. This is our estimated uncertainty.

Mathematically, model uncertainy is approximated the empirical standard deviation of the predictions for class $k$, i.e. $$\hat{\sigma}_k = \sqrt{\frac{\sum_{t=1}^T{[p_t(k|X,w) - \hat{\mu}_k}]^2}{T-1}}$$ where $$\hat{\mu}_k = \frac{1}{T}\sum_{t=1}^T p_t(k|X,w)$$ is the averaged softmax outputs of the predicted class.

Gal et. al. [3] show that the above amounts to drawing Monte Carlo samples from the predictive posterior. Their work demonstrates that applying dropout is effectively the same as defining a *Bayesian neural network* with a Bernoulli approximating prior over the parameters. [Gal has written an excellent blog post that introduces the work](http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html) and the derivation.

## Problem statement

The derivation in Gal et. al. [2] is based on the observation that a shallow neural network with infinitely many weights converges to a Gaussian process (GP) with a specific covariance function. A GP is a non-parametric, probabilistic machine learning method. The idea is that we place a prior distribution over the function space, and by observing new data points we can figure out which function is most likely to have generated the observed data. A GP is fully specified by its mean and covariance functions, and allows us to obtain uncertainty estimates [1]. The extension of this to dropout neural networks is the main result of Gal et. al. [2].

In [1] however, Gal et. al. extend this idea further to include priors over the weights of the convolutional layers in a convolutional neural network (CNN). A CNN is a specialized type of neural network, used primarily and very effectively for image analysis. The authors briefly state that the GP interpretation is lost when the model is extended to CNNs, but that these networks still can be "modelled as Bayesian".

As far as we are aware, there have been no efforts to determine the correlation between the approximated uncertainty and a network's ability to predict correctly. Moreover, the work done so far seems to focus on the uncertainty associated with the predicted class. We will examine if there is any additional information to be gained from establishing a connection between the prediction and the *runner-up* prediction. In other words, our problem statement is:

> *Are the uncertainty approximations obtained by applying Monte Carlo dropout to convolutional neural networks a reasonable measure of model uncertainty?*

## Experimental setup{.tabset}

State-of-the-art architectures such as ResNet and DenseNet are very powerful, but they are also complicated and their inner workings are quite convoluted. We are primarily interested in examining the correlation between uncertainty estimation and predictive capabilities. It is arguably better to use a simple network architecture to illustrate the idea of MC dropout. In our approach we use LeNet-5. Click on the tabs for implementation details. Full code is available on [GitHub](https://github.com/smu095/smu095.github.io/blob/master/dat259.ipynb).

### Model

LeNet-5 was a pioneering 7-layer convolutional neural network originally developed by Yann LeCunn in 1998 for handwritten digit recognition. It is hopelessly primitive compared to contemporary architectures, but still captures the gist of what a convolutional network is while remaining simple enough to allow us to understand every building block of the network. The following chunk shows the model architecture. Toggle `Code` button to the right to view.

```{python python.reticulate=FALSE, eval=FALSE}
# Building simple Bayesian CNN with MC dropout and SGD optimizer
def lenet_all(input_shape, nb_classes, p=0.5):

    # Building model
    model=Sequential([
        Conv2D(input_shape=input_shape, filters=192, kernel_size=(5,5)), # 2D convolution layer, e.g. spatial convolution over images.
        DropoutMC(p),
        MaxPooling2D(strides=2), # Max pooling operation for spatial data
                
        Conv2D(192, kernel_size=(5,5)), # 2D convolution layer, e.g. spatial convolution over images.
        DropoutMC(p),
        MaxPooling2D(strides=2), # Max pooling operation for spatial data
                    
        Flatten(), # Flattens the input
        Dense(1000, activation="relu"), # Fully connected layer
        DropoutMC(p),
        Dense(nb_classes, activation="softmax") # Output layer
    ]) 
    
    # Compiling model
    model.compile(SGD(momentum=0.9, decay=0.0005), # Optimiser
                  loss="categorical_crossentropy", # Minimisation objective
                  metrics=["accuracy"]) # Evaluation metric
    
    return model
```
<br>

### MC dropout layer

Our specification of LeNet-5 differs from the orginial in one crucial way: We use Monte Carlo dropout layers. MC dropout is not a feature that is implemented in PyTorch, and we must therefore implemenet one ourselves. Fortunately, this amounts to a simple adjustment of existing code. We modify the `Dropout` class to take an additional argument called `dropoutMC` with default value set to `True`. Toggle `Code` button to the right to view.

```{python python.reticulate=FALSE, eval=FALSE}
class DropoutMC(Layer):
    """Applies MC Dropout to the input.
    Dropout consists in randomly setting
    a fraction `rate` of input units to 0 at each update during training time,
    which helps prevent overfitting.
    MC Dropout consists in sampling from the posterior predictive distribution by activating dropout at test time.
    # Arguments
        rate: float between 0 and 1. Fraction of the input units to drop.
        noise_shape: 1D integer tensor representing the shape of the
            binary dropout mask that will be multiplied with the input.
            For instance, if your inputs have shape
            `(batch_size, timesteps, features)` and
            you want the dropout mask to be the same for all timesteps,
            you can use `noise_shape=(batch_size, 1, features)`.
        seed: A Python integer to use as random seed.
    # References
        - [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)
    """
    @interfaces.legacy_dropout_support
    def __init__(self, rate, noise_shape=None, seed=None, **kwargs):
        super(DropoutMC, self).__init__(**kwargs)
        self.rate = min(1., max(0., rate))
        self.noise_shape = noise_shape
        self.seed = seed
        self.supports_masking = True

    def _get_noise_shape(self, inputs):
        if self.noise_shape is None:
            return self.noise_shape

        symbolic_shape = K.shape(inputs)
        noise_shape = [symbolic_shape[axis] if shape is None else shape
                       for axis, shape in enumerate(self.noise_shape)]
        return tuple(noise_shape)

    def call(self, inputs, training=True):
        if 0. < self.rate < 1.:
            noise_shape = self._get_noise_shape(inputs)

            def dropped_inputs():
                return K.dropout(inputs, self.rate, noise_shape,
                                 seed=self.seed)
            return K.in_train_phase(dropped_inputs, inputs,
                                    training=training)
        return inputs

    def get_config(self):
        config = {'rate': self.rate,
                  'noise_shape': self.noise_shape,
                  'seed': self.seed}
        base_config = super(DropoutMC, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

    def compute_output_shape(self, input_shape):
        return input_shape
```
<br>

### Inference

We need to define a function that performs inference over out input. The following chunk contains the relevant code for the sampling procedure described in section 1.1. The function `inference` stores all the relevant statistics and softmax distributions in a dictionary named `output`. The results are then turned into a pandas dataframe and some very basic feature engineering is performed. Finally, the data is prepared for statistical analysis in R. Toggle `Code` button to the right to view.

```{python python.reticulate=FALSE, eval=FALSE}
def batch(img, T=100):
    ''' Creating mini-batch of T identical images for use in inference.
        
        Arguments:
        img, numpy array
        T, number of stochastic forward passes (i.e. number of times image needs to be repeated)
        '''
    img_batch = []
    
    for n in range(0,T):
        img_batch.append(img)

    return(np.array(img_batch))

def inference(model, X, y, T=100, normalize=False):    
    ''' Function that performs inference by applying MC dropout with T stochastic forward passes on given input.
        
        Arguments:
        model, a model object
        X, images to be classified
        y, corresponding labels,
        T, number of stochastic forward passes
        '''
    # Get images, labels
    imgs, labels = X, y.argmax(axis=1)
    
    if normalize:
        # Preprocessing input
        imgs = imgs.astype("float32")
        imgs -= np.mean(X)
        imgs /= np.std(X, axis=0)
    
    # Empty dictionary to store all output
    output = {}

    k=0 # iterator index to keep in dictionary

    for (img, label, x) in list(zip(imgs, labels, X)): #added x,X
        
        # Get image batch
        img_batch = batch(img, T)
        
        # T predictions on same image
        results = model.predict(img_batch, batch_size=T)

        # Gathering results
        probs = results
        probs_mean = np.mean(probs, axis=0)
        pred_std = np.std(probs, axis=0)
        prediction = probs_mean.argmax()
        uncertainty = pred_std[prediction]
        correct = 1 if prediction == label else 0

        output[k] = {"img": x, "softmax_dist": probs, "probs": probs_mean, "prediction": prediction, "truth": label, "uncertainty": uncertainty, "correct": correct} # added x for img
    
        k+=1
        
    return output
```

## Models

We will examine data gathered from four variants of LeNet-5. All models were trained on CIFAR-10 using the `fastai` API. CIFAR-10 contains 60.000 labelled 32x32x3 color images belonging to 10 different classes. The input data was split into a training set of 50.000 images and a test set of 10.000 images. The training set was further split into a training set and a validation set. All models have `weight_decay = 0.0005` and all learning rates were chosen using `lr_find`:

* `model55`: Trained for 60 epochs with a learning rate of 0.001 and `kernel size = (5,5)` and `drop_rate = .5`.  **0.71280 validation loss** at end of training with an **accuracy of 0.76137** on the validation data. `model55` represents the baseline implementation of LeNet-5. It is identical in structure to the one used by Gal et. al. [1].

* `model52`: Trained for 14 epochs with a learning rate of 0.01 for the first 7 and 0.0001 on the remaining. Changed due to rapid overfitting. Model has `kernel size = (5,5)` and `drop_rate = .2`. **0.74186 validation loss** at end of training with an **accuracy of 0.74406** on the validation data.

* `model35`: Trained for 66 epochs with a learning rate of 0.001 and `kernel size = (3,3)` and `drop_rate = .5`. **0.75483 validation loss** at end of training with an **accuracy of 0.74218** on the validation data. 

* `model32`: Trained for 52 epochs with a learning rate of 0.001 and `kernel size = (3,3)` and `drop_rate = .2`. **0.75449 validation loss** at end of training with an **accuracy of 0.74090** on the validation data.

Note that `model55` has a slightly better baseline performance than the other models.


## Data

The data contains the following variables after it has been prepared for analysis in R:

* `correct` (logical): indicator the is `TRUE` if the predicted class label matches the true class label, else `FALSE`.

* `prediction` (int): predicted class label.

* `truth` (int): true class label.

* `uncertainty` (dbl): empirical standard deviation of softmax values for predicted class.

* `prob1` (dbl): argmax of mean softmax output, i.e. mean probability of predicted class.

* `prob2` (dbl): mean probability of runner-up prediction.

* `class2` (int): class label of runner-up prediction.

* `logit_prob1` (dbl): logit transformation of `prob1`.

* `diff` (dbl): `prob1`-`prob2`

* `diff_sd_ratio` (dbl): `diff/uncertainty`.

All the variables above are pretty standard, with the exception of `diff_sd_ratio`. Intuitively, if `diff` is large, the averaged models all agree that class $k$ is the correct prediction. If `diff` is small, the models sampled by MC dropout don't agree on a single class. Thus `diff` also serves as a proxy for uncertainty. Model `uncertainty`, however, is approximated by the empirical standard deviation of the predictions for class $k$. Thus `diff_sd_ratio` is expressed by $$\tau_{kj} = \frac{\hat{\mu}_k - \hat{\mu}_j}{\hat{\sigma}_k}$$ where $j$ is the runner-up prediction. $\tau_{jk}$ gives us a ratio of two different measures of uncertainty.

# Exploratory analysis

This section will be divided into to parts. First, we will examine the resulting data from `model55`, which will be regarded as our baseline model. Next, we will analyse the data from models obtained by varying kernel sizes and dropout rates.  

## Uncertainty analysis of baseline model {.tabset}

### Entire data set

```{r}
# Importing data
data <- as.tibble(read.csv("~/Documents/Masteroppgave/Data/Resultater/results_sgd_df_20180404.csv"))
df <- select(data, -X)

# Summarizing entire data set
summary(df)
```
For the entire set of classifications, we have the following notable quantities:

* The mean **uncertainty** of the predicted class is `r mean(df$uncertainty)` and the median is `r median(df$uncertainty)`. The minimum value is `r min(df$uncertainty)`, the maximum is `r max(df$uncertainty)`. The interquartile range (IQR) is `r IQR(df$uncertainty)`.

* The mean **softmax output of the predicted class** is `r mean(df$prob1)` and the median is `r median(df$prob1)`. The minimum value is `r min(df$prob1)`, the maximum is `r max(df$prob1)`. The IQR is `r IQR(df$prob1)`.

* The mean **softmax output of the runner-up** is `r mean(df$prob2)` and the median is `r median(df$prob2)`. The minimum value is `r min(df$prob2)`, the maximum is `r max(df$prob2)`. The IQR is `r IQR(df$prob2)`.

* The mean **difference** between the softmax ouputs of the prediction and runner-up is `r mean(df$diff)` and the median is `r median(df$diff)`. The minimum value is `r min(df$diff)`, the maximum is `r max(df$diff)`. The IQR is `r IQR(df$diff)`.

* The mean **difference to uncertainty ratio**, or $\tau_{jk}$, is `r mean(df$diff_sd_ratio)` and the median is `r median(df$diff_sd_ratio)`. The minimum value is `r min(df$diff_sd_ratio)`, the maximum is `r max(df$diff_sd_ratio)`. The IQR is `r IQR(df$diff_sd_ratio)`.

### Summary statistics
```{r}
# Aggregating summary statistics by correct/incorrect
agg_df <- df %>% 
  group_by(correct) %>% 
  summarise(n=n(),
            mean_prob1=mean(prob1),
            mean_prob2=mean(prob2),
          mean_uncertainty=mean(uncertainty),
          median_uncertainty=median(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          median_diff=median(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          median_ratio=median(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
agg_df
```
* The model has **incorrectly classified** `r agg_df$n[1]` images.

* The model has **correctly classified** `r agg_df$n[2]` images.

### Incorrect classifications
```{r}
# Summarizing incorrect predictions
incorrect_df <- df %>% 
  filter(correct==0)
summary(incorrect_df)
```
For the entire set of **incorrect classifications**, we have the following notable quantities:

* The mean **uncertainty** of the predicted class is `r mean(incorrect_df$uncertainty)` and the median is `r median(incorrect_df$uncertainty)`. The minimum value is `r min(incorrect_df$uncertainty)`, the maximum is `r max(incorrect_df$uncertainty)`. The interquartile range (IQR) is `r IQR(incorrect_df$uncertainty)`.

* The mean **softmax output of the predicted class** is `r mean(incorrect_df$prob1)` and the median is `r median(incorrect_df$prob1)`. The minimum value is `r min(incorrect_df$prob1)`, the maximum is `r max(incorrect_df$prob1)`. The IQR is `r IQR(incorrect_df$prob1)`.

* The mean **softmax output of the runner-up** is `r mean(incorrect_df$prob2)` and the median is `r median(incorrect_df$prob2)`. The minimum value is `r min(incorrect_df$prob2)`, the maximum is `r max(incorrect_df$prob2)`. The IQR is `r IQR(incorrect_df$prob2)`.

* The mean **difference** between the softmax ouputs of the prediction and runner-up is `r mean(incorrect_df$diff)` and the median is `r median(incorrect_df$diff)`. The minimum value is `r min(incorrect_df$diff)`, the maximum is `r max(incorrect_df$diff)`. The IQR is `r IQR(incorrect_df$diff)`.

* The mean **difference to uncertainty ratio**, or $\tau_{jk}$, is `r mean(incorrect_df$diff_sd_ratio)` and the median is `r median(incorrect_df$diff_sd_ratio)`. The minimum value is `r min(incorrect_df$diff_sd_ratio)`, the maximum is `r max(incorrect_df$diff_sd_ratio)`. The IQR is `r IQR(incorrect_df$diff_sd_ratio)`.

### Correct classifications

```{r }
# Summarizing correct predictions
correct_df <- df %>% 
  filter(correct==1)
summary(correct_df)
```
For the entire set of **correct classifications**, we have the following notable quantities:

* The mean **uncertainty** of the predicted class is `r mean(correct_df$uncertainty)` and the median is `r median(correct_df$uncertainty)`. The minimum value is `r min(correct_df$uncertainty)`, the maximum is `r max(correct_df$uncertainty)`. The interquartile range (IQR) is `r IQR(correct_df$uncertainty)`.

* The mean **softmax output of the predicted class** is `r mean(correct_df$prob1)` and the median is `r median(correct_df$prob1)`. The minimum value is `r min(correct_df$prob1)`, the maximum is `r max(correct_df$prob1)`. The IQR is `r IQR(correct_df$prob1)`.

* The mean **softmax output of the runner-up** is `r mean(correct_df$prob2)` and the median is `r median(correct_df$prob2)`. The minimum value is `r min(correct_df$prob2)`, the maximum is `r max(correct_df$prob2)`. The IQR is `r IQR(correct_df$prob2)`.

* The mean **difference** between the softmax ouputs of the prediction and runner-up is `r mean(correct_df$diff)` and the median is `r median(correct_df$diff)`. The minimum value is `r min(correct_df$diff)`, the maximum is `r max(correct_df$diff)`. The IQR is `r IQR(correct_df$diff)`.

* The mean **difference to uncertainty ratio**, or $\tau_{jk}$, is `r mean(correct_df$diff_sd_ratio)` and the median is `r median(correct_df$diff_sd_ratio)`. The minimum value is `r min(correct_df$diff_sd_ratio)`, the maximum is `r max(correct_df$diff_sd_ratio)`. The IQR is `r IQR(correct_df$diff_sd_ratio)`.

## Distribution of uncertainty {.tabset}

In the following we will visualize the relationships between our variabels. We start by examining the empirical distribution of the uncertainty estimates $\hat{\sigma}_k$.

### Full distribution

The distribution appears to be bimodal, with peaks close to 0 and 0.2:

```{r fig.height=4, fig.width=6}
# Distribution of estimated uncertainty
p1 <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_histogram(col="grey", bins = 50, alpha=.5) +
  ggtitle("Distribution of estimated uncertainty")
p1
```

### Uncertainty by prediction

By grouping the uncertainty estimates by `correct` (i.e. if the label was correctly predicted or not), we can find out how the predictions contribute to the uncertainty distribution.

The blue line corresponds to the correct predictions, the red line corresponds to incorrect predictions. We see that the incorrect predictions are centered around a higher associated uncertainty, whereas far more of the correctly predicted classes are concentrated around a low uncertainty value. The incorrect classifications greatly contribute to the bimodality, but it is also present in the distribution of uncertainty for the correct classifications. 

```{r}
#Distribution of estimated uncertainty by prediction
p2 <- df %>% 
  ggplot(aes(x=uncertainty, col=factor(correct))) +
  geom_freqpoly(alpha=.7) +
  ggtitle("Distribution of estimated uncertainty by classification") +
  scale_color_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct"))
p2
```

### Kernel density estimates

The following kernel density estimate plot (using a Gaussian kernel) gives us an idea of how the distributions compare to eachother:

```{r}
# KDE by correct prediction
p3 <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_density(data=incorrect_df, fill="red", alpha=I(.2)) +
  geom_density(data=correct_df, fill="turquoise", alpha=I(.2)) +
  ggtitle("Kernel density estimates of uncertainty distribution")
p3
```

### Boxplots

The boxplot gives us yet another way to visualize the difference between uncertainty distributions by predictive ability. It is interesting to note the amount of outliers for the incorrect predictions, indicating som negative skew. It is wrong, but uncertainty is low.

```{r}
# Boxplot of uncertainties for correct vs. incorrect
p4 <- df %>% 
  ggplot(aes(x=factor(correct), y=uncertainty)) +
  geom_boxplot(aes(fill=factor(correct)), alpha=.7) +
  labs(x="correct") +
  ggtitle("Boxplot of uncertainty distribution by correct/incorrect") +
  scale_fill_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct"))
p4
```

## Relationship to other variables {.tabset}

### Softmax of predicted class

We may also be interested in the relationship between `uncertainty` and other variables. First, we plot `uncertainty` against `prob1` (the prediction's softmax output). The softmax outputs in the above plot are colour graded. Outputs close to 1 are red, outputs close to 0 are blue. The plot shows a clear parabolic shape:

```{r}
# Plotting softmax output of prediction against estimated uncertainty
p5 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(alpha=.2) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class") +
  scale_color_distiller(name="Softmax output",
                        palette = "Spectral")
p5
```

### By classification
Red points indicate incorrect classifications, blue points indicate correct classifications.
```{r}
p6 <- df %>% 
  mutate(correct=as.logical(correct)) %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(fill=correct, col=correct), shape=21, alpha=.5) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class")
p6
```


### Predictions and runners-up

We can obtain more information by colouring the points by the value of the runner-up predictions. This plot is particularly interesting. The softmax outputs of the runner-up classes $\hat{\mu}_j$ in the above plot are colour graded. $\hat{\mu}_j \approx 0.5$ are red, $\hat{\mu}_j \approx 0$ are blue. The round point is the pair mean values of (`prob1`, `prob2`) for the incorrect predictions. The triangular point is the pair mean values for the correct predictions.

We see a clear concentration of red points in the area where the probability of the predictied class $\hat{\mu}_k \approx 0.5$. If the softmax predictions of both the predicted class and the runner-up are close 0.5, then we have a situation analogous to maximum entropy. This points coincide with the largest approximated uncertainty values. 

```{r}
# Plotting softmax output of prediction against estimated uncertainty, coloured by softmax output of runner-up
p7 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2), shape=21, alpha=.7) +
  geom_point(data=agg_df, aes(x=mean_prob1, y=mean_prob2, shape=as.logical(correct)), size=2.5) +
  #geom_point(data=agg_df, aes(y=uncertainty, x=mean_prob1, shape=as.logical(correct)), size=2) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class", shape="correct") +
  scale_color_distiller(name="runner up",
                        palette = "Spectral")
p7
```

### Softmax of prediction by classification

In the following plot we split the observations by incorrect/correct predictions, and plot the values of uncertainty against the softmax output of the predicted class:

```{r fig.width=10, fig.height=5}
# Plotting uncertainty vs. softmax output of prediction
p8a <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2), shape=21, alpha=.7) +
  ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
  labs(x="softmax output of prediction") +
  facet_grid(.~factor(correct)) +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral")
p8a
```
### Softmax of prediction by classification with contours

The black contour lines indicate where most of the points are concentrated. The plot on the left is for incorrect predictions. The right hand plot represents the correct predictions. For the correct predictions, it seems as if far more of the points are concentrated around high predicted output/low runner-up output/low uncertainty. This is not surprising, considering 75% of the correct predictions have a softmax value of approximately 0.7 or above. For the incorrect classifications, most of the points are concentrated around the area of maximum entropy. This indicates that the approximated uncertainty estimates indeed contain valuable information in the incorrect cases.
```{r fig.width=10, fig.height=5}
# Plotting uncertainty vs. softmax output of prediction
p8 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2), shape=21, alpha=.7) +
  geom_density_2d(col="black", alpha=.3) +
  ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
  labs(x="softmax output of prediction") +
  facet_grid(.~factor(correct)) +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral")
p8
```

### Runner-up predictions

The following plot shows the relationship between uncertainty estimates and the softmax output of the runner-up prediction. Unsurprisingly, model uncertainty increases as the softmax output of the runner-up increases. We have plotted a LOESS estimate of the mean uncertainty as a function of the runner-up output to make this clearer:

```{r}
# Plotting softmax output of runner-up against estimated uncertainty
p10 <- df %>% 
  ggplot(aes(x=prob2, y=uncertainty)) +
  geom_point(alpha=.2) +
  geom_smooth(method="loess") +
  labs(x="softmax output of runner-up") +
  ggtitle("Uncertainty vs. softmax output of runner-up")
p10
```

## Images associated with high uncertainty {.tabset}

The following plots were generated using Python (code available on [GitHub](https://github.com/smu095/smu095.github.io/blob/master/dat259.ipynb)). On the left hand side we see the unnormalized image with the corresponding ground truth label. The plot in the middle shows the softmax output of the predicted class for each of the $T=100$ stochastic forward passes. $\mu_k$ is given by the solid red line, $\mu_j$ is given by the dashed brown line. The plot title shows both the predicted class and the runner-up class. To the right is a kernel density estimate (using a Gaussian kernel) of the $T=100$ softmax outputs for the predicted class. The plots show the top 5 most uncertain classifications in the entire data set.

**NOTE:** Add new images.

### Incorrect

![](Figures/4442_tf.jpeg)
![](Figures/3287_tf.jpeg)
![](Figures/8449_tf.jpeg)
![](Figures/5838_tf.jpeg)
![](Figures/3569_tf.jpeg)

### Correct
![](Figures/1223_tf.jpeg)
![](Figures/6010_tf.jpeg)
![](Figures/4468_tf.jpeg)
![](Figures/9816_tf.jpeg)
![](Figures/7486_tf.jpeg)

## Uncertainty-prediction correlation

As mentioned in section 2.2, the connection established between dropout neural networks and GPs are lost when applied to convolutional neural networks. Performing a logistic regression gives us a simple way of testing if the approximated uncertainty is a significant predictor of the model's ability to predict correctly.

The coefficient associated with `uncertainty` is highly significant, indicating that the estimates gathered from performing MC dropout are indeed a useful quantification of predictive uncertainty. Given a unit increase in uncertainty, we expect the probability of predicting correctly to decrease.

```{r}
# Fitting logistic regression model to check significance of uncertainty
model_sd <- glm(factor(correct)~uncertainty, data=df, family = binomial(link="logit"))
summary(model_sd)
```

## Referral criteria

The question is then: How do we determine a reasonable uncertainty value for referral to a human expert? As we saw in section 3.1.2, the mean uncertainty values of the incorrect predictions higher than for the correct predictions (`r agg_df$mean_uncertainty[1]` vs. `r agg_df$mean_uncertainty[2]`, respectively).

Naively, we could set the threshold for referral to the mean uncertainty of the incorrectly classified images:

```{r}
# Counting number of correct/incorrect by uncertainty >= .207 
referral_mean <- df %>% 
  filter(uncertainty>=.207) %>% 
  count(correct)
referral_mean
```

The problem here is apparent: Many of the correctly classified images are associated with a relatively high level of uncertainty (in our case, this is due to the bimodality of the uncertainty distribution in section 3.2.1). The proportion of incorrectly classified images in the referred sample is `r referral_mean$n[1]/sum(referral_mean$n)`. 

The result of our logistic regression suggests that uncertainty is a significant predictor of a model's ability to predict correctly. However, although uncertainty seems to contain valueable information, the relative uncertainties of the correct/incorrect observations (in this case) may be too small to differentiate which images should be referred to an expert. We may need to amplify the quantification of uncertainty in some way.

### Usefulness of runner-up predictions

For the most uncertain incorrectly classified images in section 3.4.1 the runner-up suggestions are in fact the ground truth labels 80% of the time. This may be due to chance, but still begs the question: Is there any information to be obtained from the runner-up predictions? What would happen to our overall accuracy if we used the runner-up predictions for all incorrect classifications?

```{r}
# Counting classification accuracy if runner-up is equal to ground truth
class2_df <- df %>%
  mutate(correct=replace(correct, correct==0 & class2==truth, 1)) %>% 
  count(correct)
class2_df
```

Accuracy would rise to `r class2_df$n[2]/(sum(class2_df$n))`. This indicates that there may be some valuable information to be gathered from the runner-up predictions.

### A revised quantification of uncertainty

One possible approach is to use the value of $\tau_{jk}$ introduced in section 2.5. Recall that $\tau_{jk}$ incorporates information about the runner-up prediction into our quantification of uncertainty. Consider the following cases:

* $\tau_{kj} \approx 1$ means that `diff` and `uncertainty` are relatively similar. This happens if a) the models have failed to reach a consensus (`diff` is small) but model uncertainty is low, or b) the models have reached a consensus (`diff` is large) but model uncertainty is high. Let's call these **referral predictions**.

* $\tau_{kj} \to 0$ means that `uncertainty` is much larger than `diff`. These should represent **uncertain predictions**.

* $\tau_{kj} \to \infty$ means that `uncertainty` is much smaller than `diff`. These should represent **non-referral predictions**.

As we saw in table 3.1.2, the mean $\tau_{jk}$ for the incorrectly classified images is `r agg_df$mean_ratio[1]` and `r agg_df$mean_ratio[2]` for the correctly classified images. The respective medians are `r agg_df$median_ratio[1]` and `r agg_df$median_ratio[2]`.

As a first step, setting the referral threshold to $\tau_{jk} \leq 1.83$ (the mean uncertainty of the incorrect classifications) gives:

```{r}
# Counting number of correct/incorrect by tau <= 2.8
referral_meantau <- df %>% 
  filter(diff_sd_ratio <= 1.83) %>% 
  count(correct)
referral_meantau
```

We run into the same problem as when we used the mean uncertainty: Many of the correctly predicted images would be referred. It is interesting to note, however, that we get more referrals of incorrectly classified images. In this case, the proportion of incorrectly classified images in the referred sample is `r referral_meantau$n[1]/sum(referral_meantau$n)`. 

Upon inspection, the boxplot of the log-transformed $\tau_{jk}$ clearly shows the presence of extreme values (outliers marked in red) in both tails. For incorrect predictions, the distribution seems to be negatively skewed. The distribution of the correct predictions appear to be positively skewed. 

```{r}
pt <- df %>% 
  ggplot(aes(x=factor(correct), y=log(diff_sd_ratio))) +
  geom_boxplot(aes(fill=factor(correct)), outlier.colour = "red", outlier.alpha = .15) +
  xlab("correct") +
  ylab("log-transformed tau")  +
  ggtitle("Presence of extreme uncertainty values") +
  labs(fill="correct")
pt
```


Perhaps a more robust measure of the central tendency of $\tau_{jk}$ is the median. Setting the referral rate of $\tau_{jk} \leq 1.2$ gives the following:

```{r}
# Counting number of correct/incorrect by tau <= 1
referral_medtau <- df %>% 
  filter(diff_sd_ratio <= .965) %>% 
  count(correct)
referral_medtau
```

This is a clear improvement over our the mean approach, in terms of the amount of referred images. `r abs(referral_medtau$n[1]-referral_meantau$n[1])` fewer incorrect images are referred compared to `r abs(referral_medtau$n[2]-referral_meantau$n[2])` fewer correct images. In this case, the proportion of incorrectly classified images in the referred sample is `r referral_medtau$n[1]/sum(referral_medtau$n)`.

When compared to using the mean of $\hat{\sigma}_k$ as a cut-off for referral, the median of $\tau_{jk}$ results in almost 20% more referrals of incorrect image relative to the sample size. This could indicate that $\tau_{jk}$ is a more sensitive uncertainty quantification for practical applications.

## Comparison of tau and uncertainty{.tabset}

We can examine this further by plotting the number of correct/incorrect referrals against different values for $\tau_{jk}$ and $\sigma_k$.

### Data preparation
First we need to prepare the data for plotting. In the following chunk we count the number of referrals of correctly/incorrectly classified images by increasing the values of $\tau_{jk}$ and $\sigma_k$:

* For $\tau_{jk}$, we set the initial value $\tau_1 = 0.05$ and increment by 0.01 until we reach a maximum value of 10. This cut-off value was chosen because it gives us roughly the same proportion of correct vs. incorrect classifications as our threshold for $\sigma_k$. For every $\tau_{jk} \leq \tau_i$ we count the number correct and incorrect classifications of the referred images. Finally, we calculate the relative proportions of correctly/incorrectly classified images per value.

* For $\sigma_k$, we set the initial value to $\sigma_1=0.01$ and increment by 0.01 until we reach a maximum value of .35. This cut-off value was chosen because it gives us roughly the same proportion of correct vs. incorrect classifications as our threshold for $\tau_{jk}$. For every $\sigma_{k} \geq \sigma_i$ (i.e. all images that have higher uncertainty than this value) we count the number correct and incorrect classifications of the referred images.

```{r}
# Preparing data
count_df <- df %>% 
  rename(tau=diff_sd_ratio) %>% 
  select(correct, uncertainty, tau)

# Gathering number of referrals by tau
roc_tau <- data.frame(n_false=numeric(), n_true=numeric(), tau_value=numeric())
idx=1
tau_seq <- seq(0.05, 10, by=.1)

for(value in tau_seq){
  get_results <- count_df %>%
    filter(tau<=value) %>% 
    count(correct)
  roc_tau[idx,] <- c(get_results$n, value)
  idx <- idx + 1
}

plot_tau <- roc_tau %>% 
  gather(status, n, -tau_value) %>% 
  group_by(tau_value) %>% 
  mutate(p=n/sum(n)) %>% 
  ungroup()

## Gathering number of referrals per by uncertainty
roc_unc <- data.frame(n_false=numeric(), n_true=numeric(), uncertainty_value=numeric())
idx=1
uncertainty_seq <- seq(0.01, .35, by=.01)

for(value in uncertainty_seq){
  get_results <- count_df %>%
    filter(uncertainty>=value) %>% 
    count(correct)
  roc_unc[idx,] <- c(get_results$n, value)
  idx <- idx + 1
}

plot_unc <- roc_unc %>% 
  gather(status, n, -uncertainty_value) %>% 
  group_by(uncertainty_value) %>% 
  mutate(p=n/sum(n)) %>% 
  ungroup()
```

### Plotting
The first row of the plot contains the referrals for varying values of $\sigma_k$. The second row shows referral counts for varying values of $\tau_{jk}$. The first column show the absolute counts (note the different x-axes) and the second column shows the relative referral rates. The dashed horisontal lines represent the mean of $\sigma_k$ and the median of $\tau_{jk}$, respectively, for the incorrectly classified images.

```{r fig.height=6, fig.width=8}
# Absolute counts
p1a <- ggplot(plot_unc, aes(x=uncertainty_value, y=n, group=status)) +
  geom_line(aes(col=status)) +
  geom_vline(xintercept=agg_df$mean_uncertainty[1], lwd=.3, lty="dashed") +
  labs(x="uncertainty", y="number of referrals") +
  ggtitle("Absolute referral counts") +
  theme(legend.position = "none")

p2a <- ggplot(plot_tau, aes(x=tau_value, y=n, group=status)) +
  geom_line(aes(col=status)) +
  geom_vline(xintercept=agg_df$median_ratio[1], lwd=.3, lty="dashed") +
  labs(x="tau", y="number of referrals") +
  ggtitle("Absolute referral counts")+
  theme(legend.position = "none")

# Relative counts
p1b <- ggplot(plot_unc, aes(x=uncertainty_value, y=p, group=status)) +
  geom_line(aes(col=status)) +
  labs(x="uncertainty", y="proportion of referrals") +
  geom_vline(xintercept=agg_df$mean_uncertainty[1], lwd=.3, lty="dashed") +
  ggtitle("Relative referral rates")+
  theme(legend.position = "none")

p2b <- ggplot(plot_tau, aes(x=tau_value, y=p, group=status)) +
  geom_line(aes(col=status)) +
  labs(x="tau", y="proportion of referrals") +
  geom_vline(xintercept=agg_df$median_ratio[1], lwd=.3, lty="dashed") +
  ggtitle("Relative referral rates")+
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4), nrow = 2, ncol = 2, byrow=TRUE)
multiplot(p1a, p1b, p2a, p2b, layout = layout)
```

### Tau vs. uncertainty

Plotting the absolute referral counts for varying values of $\tau_{jk}$ (blue line) and $\sigma_k$ (red line) in the same plot tells as that $\tau_{jk}$ gives us more "bang for the buck", so to speak.

```{r}
p3a <- ggplot(roc_tau, aes(x=n_true, y=n_false)) +
  geom_line(col="blue") +
  geom_line(data=roc_unc, aes(x=n_true, y=n_false), col="red") +
  ylim(c(0,2000)) +
  xlim(c(0,8000))
p3a
```


# Digit classification on MNIST

**Draft:** Checking to see if $\tau_{jk}$ is useful on a completely different data set (MNIST).

```{r}
# Importing MNIST data
data_mnist <- as.tibble(read.csv("~/Documents/Masteroppgave/Data/Resultater/mnist_df.csv"))
df_mnist <- select(data_mnist, -X)

# Summarizing entire data set
summary(df_mnist)
```

```{r}
# Aggregating summary statistics by correct/incorrect
agg_df_mnist <- df_mnist %>% 
  group_by(correct) %>% 
  summarise(n=n(),
            mean_prob1=mean(prob1),
            mean_prob2=mean(prob2),
          mean_uncertainty=mean(uncertainty),
          median_uncertainty=median(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          median_diff=median(diff),
          sd_diff=sd(diff),
          mean_tau=mean(diff_sd_ratio),
          median_tau=median(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
agg_df_mnist
```

```{r}
# Finding cases where runner up equals true class
medtau <- df_mnist %>% 
  filter(class2 == truth) %>% 
  summarise(n = n(),
    mean_uncertainty = mean(uncertainty),
            median_uncertainty = median(uncertainty),
            mean_tau = mean(diff_sd_ratio),
            median_tau = median(diff_sd_ratio))
medtau
```

```{r}
# Setting referral rate to mean uncertainty of incorrect predictions
referral_mnist_sigma <- df_mnist %>% 
  filter(uncertainty >= .22) %>% 
  count(correct)
referral_mnist_sigma
```
In this case, the proportion of incorrectly classified images in the referred sample is `r referral_mnist_sigma$n[1]/sum(referral_mnist_sigma$n)`.

```{r}
# Setting referral rate to mean uncertainty of incorrect predictions
referral_mnist_tau <- df_mnist %>% 
  filter(diff_sd_ratio <= 1.36) %>% 
  count(correct)
referral_mnist_tau
```

In this case, the proportion of incorrectly classified images in the referred sample is `r referral_mnist_tau$n[1]/sum(referral_mnist_tau$n)`. Again, preliminary results suggest that runner-up predictions can be leveraged to produce sensible cut-off values for referral.